Individual Income Model

Philippe Heymans Smith

2021-02-17

This document is submitted to Acxiom on behalf of Growth Acceleration Partners for your consideration.

1 Problem Statement

The objective set is to predict the individual income (PINCP) of a Texas resident given a sample of the American Community Survey.

The project specifications include two datasets, a data dictionary, and a reference handbook split into 4 chapters and an appendix.

2 Data Overview

Both datasets provided are part of the Public Use Microdata Sample (PUMS) initiative by the American Community Survey Office. This initiative is meant to give public access to most questions and answers on the American Community Survey (ACS), while safeguarding the confidentiality of its respondents. Each file is a sample of 1% of its corresponding ACS dataset, which means that larger margins of error should be expected than on ACS-provided summaries.

2.1 Geographic data

The PUMS files contain, in hierarchical order: region, division, state and Public Use Microdata area (PUMA). The latter are non-overlapping areas that cover roughly 100,000 populations each, and this is the lowest level of disaggregation on the file, due to privacy considerations.

3 Data processing

We visibly load the libraries that will be used in the study.

library(chronicle) # Developed by Philippe Heymans, see its Appendix section for further details 
# remotes::install_github(repo = 'pheymanss/chronicle')
library(data.table)
library(knitr)
library(ggplot2)
library(magrittr)
library(purrr)
library(stringr)
library(readr)
library(skimr)

3.1 Data dictionary parsing

To get a better understanding of the data, we proceed to map the PUMS variable names to their corresponding descriptions. The parsing of the data dictionary was rather troublesome, since the ACS houses two record types carelessly braided together on the same .csv file:

  • The specification for each field, which comprises 5 values:
  1. Record type (NAME or VAL).
  2. Column name.
  3. Column type (numeric or categorical).
  4. Field length.
  5. Description.
  • The specification for each possible field value, which comprises the same values, but additionally the range of values that each entry can take.
  1. Record type (NAME or VAL).
  2. Column name.
  3. Column type (numeric or categorical).
  4. Field length.
  5. Starting Value (first element of the value range).
  6. Ending Value (last element of the value range).
  7. Description.

This makes the VAL record type a more complete mapping of the content of the data, but the NAME record type is more prectical for quick checks. We parse them both separately into a list.

dict_filename <- 'Technical Test Data and Information/PUMS_Data_Dictionary_2019.csv'

dictionaries <- readr::read_lines(dict_filename) %>% 
  unique() %>% 
  split(str_detect(., '^NAME')) %>% # identify each dataset
  purrr::set_names(c('values', 'names')) %>% 
  purrr::map2(.y = list(val_columns = c('field_type',
                                        'column_name',
                                        'column_type',
                                        'field_length',
                                        'starting_value',
                                        'ending_value',
                                        'description'),
                        name_columns = c('field_type',
                                         'column_name',
                                         'column_type',
                                         'field_length',
                                         'description')),
              .f = ~data.table::fread(text = c(paste(.y, collapse = ','),
                                               .x)))

Additionally, a mark was added to point if the variable belongs to the Person or the Housing Unit dataset. The first variable of the Person dataset is SPORDER, so any entry before it belongs to the Housing Unit data.

# find the index of SPORDER in both the $names and $values tables
sporder_index <- dictionaries %>% 
  purrr::map(~which(.x$column_name %chin% 'SPORDER'))

# map the corresponding dataset flag to each row
walk2(.x = dictionaries, 
      .y = sporder_index,
      .f = ~.x[, record_number := 1:.N
             ][, dataset := data.table::fifelse(test = record_number < .y,
                                           yes = 'H',
                                           no = 'P')]) 

With this code, this:

## NAME,RT,C,1,"Record Type" 
##  VAL,RT,C,1,"H","H","Housing Record or Group Quarters Unit" 
##  VAL,RT,C,1,"P","P","Person Record" 
##  NAME,SERIALNO,C,13,"Housing unit/GQ person serial number" 
##  VAL,SERIALNO,C,13,"2019GQ0000001","2019GQ9999999","GQ Unique identifier" 
##  VAL,SERIALNO,C,13,"2019HU0000001","2019HU9999999","HU Unique identifier" 
##  NAME,DIVISION,C,1,"Division code based on 2010 Census definitions" 
##  VAL,DIVISION,C,1,"0","0","Puerto Rico" 
##  VAL,DIVISION,C,1,"1","1","New England (Northeast region)" 
##  VAL,DIVISION,C,1,"2","2","Middle Atlantic (Northeast region)" 
##  VAL,DIVISION,C,1,"3","3","East North Central (Midwest region)" 
##  VAL,DIVISION,C,1,"4","4","West North Central (Midwest region)" 
##  VAL,DIVISION,C,1,"5","5","South Atlantic (South region)" 
##  VAL,DIVISION,C,1,"6","6","East South Central (South region)" 
##  VAL,DIVISION,C,1,"7","7","West South Central (South Region)" 
##  VAL,DIVISION,C,1,"8","8","Mountain (West region)" 
##  VAL,DIVISION,C,1,"9","9","Pacific (West region)" 
##  NAME,PUMA,C,5,"Public use microdata area code (PUMA) based on 2010 Census definition (areas with population of 100,000 or more, use with ST for unique code)" 
##  VAL,PUMA,C,5,"00100","70301","Public use microdata area codes" 
##  NAME,REGION,C,1,"Region code based on 2010 Census definitions"

Turns into these:

Column dictionary

field_type column_name column_type field_length description record_number dataset
NAME RT C 1 Record Type 1 H
NAME SERIALNO C 13 Housing unit/GQ person serial number 2 H
NAME DIVISION C 1 Division code based on 2010 Census definitions 3 H
NAME PUMA C 5 Public use microdata area code (PUMA) based on 2010 Census definition (areas with population of 100,000 or more, use with ST for unique code) 4 H
NAME REGION C 1 Region code based on 2010 Census definitions 5 H
NAME ST C 2 State Code based on 2010 Census definitions 6 H

Value dictionary

field_type column_name column_type field_length starting_value ending_value description record_number dataset
VAL RT C 1 H H Housing Record or Group Quarters Unit 1 H
VAL RT C 1 P P Person Record 2 H
VAL SERIALNO C 13 2019GQ0000001 2019GQ9999999 GQ Unique identifier 3 H
VAL SERIALNO C 13 2019HU0000001 2019HU9999999 HU Unique identifier 4 H
VAL DIVISION C 1 0 0 Puerto Rico 5 H
VAL DIVISION C 1 1 1 New England (Northeast region) 6 H

These description fields do not give details regarding methodology nor definitions, for the latter there are the US Census Bureau subject definitions.

And for future convenience, this process can be packaged into a single call:

dict_filename <- 'Technical Test Data and Information/PUMS_Data_Dictionary_2019.csv'
dicts <- parse_dictionaries(dict_filename)

str(object = dicts, give.atr = F, level = 2, vec.len = 2)
## List of 2
##  $ values:Classes 'data.table' and 'data.frame': 5368 obs. of  9 variables:
##   ..$ field_type    : chr [1:5368] "VAL" "VAL" ...
##   ..$ column_name   : chr [1:5368] "RT" "RT" ...
##   ..$ column_type   : chr [1:5368] "C" "C" ...
##   ..$ field_length  : int [1:5368] 1 1 13 13 1 ...
##   ..$ starting_value: chr [1:5368] "H" "P" ...
##   ..$ ending_value  : chr [1:5368] "H" "P" ...
##   ..$ description   : chr [1:5368] "Housing Record or Group Quarters Unit" "Person Record" ...
##   ..$ record_number : int [1:5368] 1 2 3 4 5 ...
##   ..$ dataset       : chr [1:5368] "H" "H" ...
##   ..- attr(*, ".internal.selfref")=<externalptr> 
##  $ names :Classes 'data.table' and 'data.frame': 516 obs. of  7 variables:
##   ..$ field_type   : chr [1:516] "NAME" "NAME" ...
##   ..$ column_name  : chr [1:516] "RT" "SERIALNO" ...
##   ..$ column_type  : chr [1:516] "C" "C" ...
##   ..$ field_length : int [1:516] 1 13 1 5 1 ...
##   ..$ description  : chr [1:516] "Record Type" "Housing unit/GQ person serial number" ...
##   ..$ record_number: int [1:516] 1 2 3 4 5 ...
##   ..$ dataset      : chr [1:516] "H" "H" ...
##   ..- attr(*, ".internal.selfref")=<externalptr>

3.2 Data Loading and joining

At 253MB (70MB + 183MB), this dataset should not present issues on any reasonably modern hardware setup, and that is precisely the reason why the PUMS dataset exists. Since the PUMS files are a 1% random sampling of the full state dataset, and assuming reasonable stability in the size of the fields, the size of the full dataset would range around 25GB. This would indubitably be quite more demanding, but still not enough to be a limiting factor of the steps carried out in this particular study.

If this was a national study done with the full ACS dataset instead of a Texas-focused PUMS study, then the size of the raw dataset would indeed be a limiting factor, and loading the entire dataset into memory would not be feasible on consumer-grade hardware. Supposing that it is not worth it to load the dataset into a distributed filesystem setup, the healthier approach would be to start discarding portions of the data that would not be used in the model. If discarding all those columns was not enough for the full dataset, then performing a sample of the data could suffice to diagnose, re-structure and calibrate the desired model. However, assuring a representative sampling can potentially be more time consuming than the study itself.

Feature selection will be performed further on, but these will be motivated by good statistical practices, not by data size problems.

We will load both datasets into memory. The column ‘RT’ is discarded because it is an indicator of the dataset, P for Person and H for Housing Unit, wich will become meaningless once we merge both files.

# load person data
per <- data.table::fread('Technical Test Data and Information/psam_p48.csv') %>% 
  discard_cols(cols = 'RT')
# 272,776 rows

# load housing unit data
hu <- data.table::fread('Technical Test Data and Information/psam_h48.csv') %>% 
  discard_cols(cols = 'RT')
# 123,710 rows

Since the amount of columns of each table is quite extensive, we will first join both tables, and seek to keep only the variables that both are potentially relevant to the study at hand, and have a reasonable amount of valid data. Although the documentation suggests joining the tables by ‘SERIALNO’, the records also share several location variables, which will be also used on the join criteria to avoid duplication.

join_cols <- intersect(colnames(per), colnames(hu))
# this is c("SERIALNO", "DIVISION", "PUMA", "REGION", "ST", "ADJINC")

# set data.table keys for fast joining
data.table::setkeyv(per, join_cols)
data.table::setkeyv(hu,  join_cols)

# data.table left join
dat <- data.table::merge.data.table(x = per, hu, all.x = TRUE)

# remove the separate tables from memory
rm(per)
rm(hu)

# map all character variables as.character:
# first, get all the column names in the dictionary that correspond to 'C' column types. 
character_cols <- dictionaries$names[column_type == 'C']$column_name %>% intersect(colnames(dat))

# and then coerce all of them to character by reference
dat[, (character_cols) := purrr::map(.SD, as.character), .SDcols = character_cols]

3.3 Column discarding

The joint table comprehends 517 variables, but we can quickly reduce the width of the dataset with three simple ideas:

  • Over 100 of those columns are weight values for computing representative aggregates, which is not part of the scope of this study.
dat <- discard_cols(dt = dat, cols = c('^WGTP', '^PWGTP'), regex = TRUE)
  • We will not discriminate on whether the item comes from a true response or an allocated response.
flag_columns <- dictionaries$names[str_detect(description, 'allocation flag')]$column_name
dat <- discard_cols(dt = dat, cols = flag_columns, regex = TRUE)
  • We can remove columns that have a single constant value.
unique_values <- data.table(column = colnames(dat),
                            unique_values = purrr::map_int(dat, uniqueN)
                            )[unique_values == 1
                            ][, description := describe_column(column)][]

knitr::kable(unique_values)
column unique_values description
DIVISION 1 Division code based on 2010 Census definitions
REGION 1 Region code based on 2010 Census definitions
ST 1 State Code based on 2010 Census definitions
ADJINC 1 Adjustment factor for income and earnings dollar amounts (6 implied decimal places)
ADJHSG 1 Adjustment factor for housing dollar amounts (6 implied decimal places)
VACS 1 Vacancy status

As expected, these columns mostly correspond to geographical entries that are all the same for all Texan residents, and adjustment coefficients that are all the same for every observation done on the same time period.

dat <- discard_cols(dt = dat, cols = unique_values$column)

With that, we went from 517 to 218 variables, a much more manageable amount of columns to consider in our model. We can now proceed with studying the contents of such columns.

3.4 Valid data coverage

Let us first examining the percentage of valid data of all the other variables.

na_perc <- na_percentage(dat) %>% 
  data.table::merge.data.table(y = dictionaries$names[, .(column_name, column_type, description)],
                               by.x = 'column',
                               by.y = 'column_name')

na_perc[order(-na_perc)] %>% head(15) %>% knitr::kable()
column na_count na_perc column_type description
GCM 270091 0.9901568 C Length of time responsible for grandchildren
DRAT 267422 0.9803722 C Veteran service connected disability rating (percentage)
SMP 266745 0.9778903 N Total payment on all second and junior mortgages and home equity loans (monthly amount, use ADJHSG to adjust SMP to constant dollars)
FOD2P 266193 0.9758666 C Recoded field of degree - second entry
GCR 265672 0.9739567 C Grandparents responsible for grandchildren
FULP 265544 0.9734874 N Fuel cost (yearly cost for fuels other than gas and electricity, use ADJHSG to adjust FULP to constant dollars)
SFN 262390 0.9619248 C Subfamily number
SFR 262390 0.9619248 C Subfamily relationship
FHINS5C 262009 0.9605281 C TRICARE coverage given through the eligibility coverage edit
MHP 259555 0.9515317 N Mobile home costs (yearly amount, use ADJHSG to adjust MHP to constant dollars)
CITWP 255299 0.9359291 N Year of naturalization write-in
MLPA 254956 0.9346717 C Served September 2001 or later
MLPB 254956 0.9346717 C Served August 1990 - August 2001 (including Persian Gulf War)
MLPCD 254956 0.9346717 C Served May 1975 - July 1990
MLPE 254956 0.9346717 C Served Vietnam era (August 1964 - April 1975)

Most variables with high NA coverage are ones for which there exists some sort of “No data/Does not apply” value, but it is not explicitly stated, instead it is left empty (which is a reasonable space saving decision) and subsequently read as an NA by R. This means that there is no need to discard most of these variables, since they are easily mapped into the default response ‘b’ == ’Does not apply ’for categorical variables, and 0 would be appropriate for most numerical variables since those represent costs, payments and counts.

numeric_impute <- dictionaries$names[column_type == 'N']$column_name %>%
  intersect(colnames(dat)) %>% 
  setdiff('PINCP') # our target variable deserves special consideration 

na_fill(dt = dat,  
        cols = numeric_impute,
        replacement = 0)

# for categoricals, only the variables that have an explicit 'b' = 'Does not apply' field will be replaced.
categorical_impute <- dictionaries$names[column_type == 'C']$column_name %>%
  intersect(colnames(dat)) %>% 
  intersect(dictionaries$values[column_type == 'C' & starting_value == 'b']$column_name)

na_fill(dt = dat,  
        cols = categorical_impute,
        replacement = 'b')

Recomputing the valid data coverage metrics, we see much better numbers for most variables, only keeping NAs in the 9 categorical variables that are not mappable to a ‘b’ field, and our target variable that we explicitly decided not to impute.

na_perc <- na_percentage(dat) %>% 
  data.table::merge.data.table(y = dictionaries$names[, .(column_name, column_type, description)],
                               by.x = 'column',
                               by.y = 'column_name')

na_perc[na_perc > 0] %>% head(10) %>% knitr::kable()
column na_count na_perc column_type description
BLD 10876 0.0398715 C Units in structure
FOD1P 208086 0.7628457 C Recoded field of degree - first entry
FOD2P 266193 0.9758666 C Recoded field of degree - second entry
HHLANP 10876 0.0398715 C Detailed household language
HHT2 10876 0.0398715 C Household/family type (includes cohabiting)
INDP 113669 0.4167119 C Industry recode for 2018 and later based on 2017 IND codes
JWAP 155057 0.5684408 C Time of arrival at work - hour and minute
JWDP 155057 0.5684408 C Time of departure for work - hour and minute
JWTRNS 147442 0.5405241 C Means of transportation to work
LANP 196995 0.7221860 C Language spoken at home

3.5 Categorical variable decoding

Now, with all most variables filled, we can proceed and map them into their true values. The strategy here is to fetch the values specified in the data dictionary, and rewrite the values of each coded column by reference for the minimum memory overhead possible. We first create the list of categorical variables to be decoded:

# select the 'C' variables according to the dictionary
columns_to_decode <- dictionaries$names[column_type %chin% 'C']$column_name 

# do not decode range categorical variables (ie where starting_value != ending_value)
# These are c("SERIALNO", "SPORDER", "PUMA", "MIGPUMA", "POWPUMA", "RACNUM")
exclude_from_decoding <- 
  dictionaries$values[column_type %chin% 'C' & starting_value != ending_value]$column_name

columns_to_decode %<>% setdiff(exclude_from_decoding) 

# and of course, decode only the columns that are still present in the data
columns_to_decode %<>% intersect(colnames(dat))

The following function then rewrites the content of all coded columns into their explicit values. For example, for the ‘AGS’ column (Sales of Agriculture Products): the value ‘1’ will be changed to ‘None’, ‘2’ will be changed to ‘$1 - $999’, and so on.

# and define a function that decodes a column
decode_categoricals <- function(col, dat){
  # select the dictionary entries speciying column values
  column_mapping <- dictionaries$values[column_name %chin% col & starting_value == ending_value,
                                        .(starting_value, description)] %>% 
    data.table::setkey(starting_value)
  
  # data.table 'left join' by reference
  # technically this is a mutate by reference, not a join
  matching_statement <- paste0(col, '==', 'starting_value')
  setkeyv(dat, col)
  dat[column_mapping, on = matching_statement, (col) := i.description]
}

tictoc::tic() 
# apply the function to all categorical columns
columns_to_decode %>% walk(decode_categoricals, dat = dat)
tictoc::toc()
## 11.66 sec elapsed

On this execution time, the function was able to compute the mappings of all 169 coded columns, rewriting over the same pointers which also means a very minimal memory overhead.

With this final step, not only do we have a workable data file that needs no external reference (besides column names, whose descriptions are too long to use as column names directly), but this also can be wrapped into into a single function, allowing us to efficiently parse all past and future PUMS files that follow the same structure.

per_fn <- 'Technical Test Data and Information/psam_p48.csv'
hu_fc <- 'Technical Test Data and Information/psam_h48.csv'
dict_filename <- 'Technical Test Data and Information/PUMS_Data_Dictionary_2019.csv'

pums <- pums_for_individual_modelling(person_filename = per_fn,
                                      housing_unit_filename = hu_fc,
                                      dictionary_filename = dict_filename)
## PUMS files successfully parsed: 32.94 sec elapsed

4 Exploratory data analysis

4.1 Target variable analysis

As we have an explicit target variable, we can firts center our data exploration in function of the relationship between each variable and our y variable: PINCP. This is a continuous numerical variable, which means we are dealing with a regression problem. We start off by understanding our variable:

num_summary(dt = dat, col = 'PINCP') %>% knitr::kable()
count distinct_values na_count na_perc sum mean sd min percentile.25 percentile.50 percentile.75 percentile.975 percentile.99 max
222893 7414 49883 0.1828717 9808773094 44006.64 67379.04 -8400 7500 25000 55000 200000 372000 1166600

From these statistics, we can see that the variable reaches negative values, which is a valid data entry if the respondent experienced losses in the study period. There is no no solid reason currently to discard those records.

On the other hand, and as previously pointed out in the prior section, 18% of the PINCP fields are empty, and there are records with 0 assigned to that column, hence it would be reckless to conclude that missing data means zero. We will remove those observations from the data.

dat <- dat[!is.na(PINCP)]

We reach then 222,893 valid observations of an individual.

And now we can proceed with visualizing our variable. As expected with income variables, its distribution is heavily asymmetrical, heavily right skewed even after the variable being bounded at 1,166,600.

chronicle::make_density(dt = dat, value = 'PINCP', static = FALSE)

Then, we can study the relationship between PINCP and all other numerical variables through a linear correlation table

# select numeric columns only
nums <- dat %>% purrr::keep(is.numeric) %>% colnames()

# compute correlations against the individual ncome variable, and count NA values 
income_cor <- data.table(column_name = nums,
                  income_cor = purrr::map_dbl(.x = nums,
                                              .f = ~cor(na.omit(dat[, .(PINCP, get(.x))]))[1,2] %>% round(2)),
                  na_perc = purrr::map_dbl(.x = nums, 
                                           .f = ~mean(is.na(dat[[.x]])))
                  )[, description := describe_column(column_name)]

income_cor[order(-abs(income_cor))] %>% knitr::kable()
column_name income_cor na_perc description
PINCP 1.00 0 Total person’s income (signed, use ADJINC to adjust to constant dollars)
PERNP 0.90 0 Total person’s earnings (use ADJINC to adjust to constant dollars)
WAGP 0.85 0 Wages or salary income past 12 months (use ADJINC to adjust WAGP to constant dollars)
HINCP 0.64 0 Household income (past 12 months, use ADJINC to adjust HINCP to constant dollars)
FINCP 0.54 0 Family income (past 12 months, use ADJINC to adjust FINCP to constant dollars)
POVPIP 0.43 0 Income-to-poverty ratio recode
INTP 0.40 0 Interest, dividends, and net rental income past 12 months (signed, use ADJINC to adjust to constant dollars)
WKHP 0.38 0 Usual hours worked per week past 12 months
WKWN 0.35 0 Weeks worked during past 12 months
TAXAMT 0.35 0 Property taxes (yearly real estate taxes)
SEMP 0.33 0 Self-employment income past 12 months (signed, use ADJINC to adjust SEMP to constant dollars)
SMOCP 0.33 0 Selected monthly owner costs (use ADJHSG to adjust SMOCP to constant dollars)
VALP 0.32 0 Property value
INSP 0.28 0 Fire/hazard/flood insurance (yearly amount, use ADJHSG to adjust INSP to constant dollars)
MRGP 0.26 0 First mortgage payment (monthly amount, use ADJHSG to adjust MRGP to constant dollars)
RMSP 0.23 0 Number of rooms
JWMNP 0.21 0 Travel time to work
MARHYP 0.21 0 Year last married
RETP 0.21 0 Retirement income past 12 months (use ADJINC to adjust to constant dollars)
SPORDER -0.19 0 Person number
JWRIP 0.19 0 Vehicle occupancy
BDSP 0.19 0 Number of bedrooms
GRPIP -0.14 0 Gross rent as a percentage of household income past 12 months
AGEP 0.13 0 Age
ELEP 0.12 0 Electricity cost (monthly cost, use ADJHSG to adjust ELEP to constant dollars)
GASP 0.10 0 Gas cost (monthly cost, use ADJHSG to adjust GASP to constant dollars)
WATP 0.10 0 Water cost (yearly cost, use ADJHSG to adjust WATP to constant dollars)
OCPIP -0.08 0 Selected monthly owner costs as a percentage of household income during the past 12 months
OIP 0.06 0 All other income past 12 months (use ADJINC to adjust to constant dollars)
NP -0.06 0 Number of persons in this household
SMP 0.06 0 Total payment on all second and junior mortgages and home equity loans (monthly amount, use ADJHSG to adjust SMP to constant dollars)
SSP 0.05 0 Social Security income past 12 months (use ADJINC to adjust SSP to constant dollars)
SSIP -0.04 0 Supplementary Security Income past 12 months (use ADJINC to adjust SSIP to constant dollars)
CONP 0.04 0 Condo fee (monthly amount, use ADJHSG to adjust CONP to constant dollars)
GRNTP -0.04 0 Gross rent (monthly amount, use ADJHSG to adjust GRNTP to constant dollars)
YOEP -0.03 0 Year of entry
RNTP -0.03 0 Monthly rent (use ADJHSG to adjust RNTP to constant dollars)
NPF -0.03 0 Number of persons in family (unweighted)
CITWP 0.02 0 Year of naturalization write-in
MHP -0.02 0 Mobile home costs (yearly amount, use ADJHSG to adjust MHP to constant dollars)
NRC -0.02 0 Number of related children in household (unweighted)
PAP -0.01 0 Public assistance income past 12 months (use ADJINC to adjust to constant dollars)
NOC 0.01 0 Number of own children in household (unweighted)
FULP 0.00 0 Fuel cost (yearly cost for fuels other than gas and electricity, use ADJHSG to adjust FULP to constant dollars)

Because most numerical variables are likely to be higher on better-off individuals, it stands to reason that most of them are positively correlated with our income target variable.

We will however remove all components of the PINCP variable present in the training data (ie all income categories) since those variables give out an explicit sum to get our target variable, and keeping them would defeat the purpose of the exercise.

dat <- discard_cols(dt = dat, cols = c('PERNP', 'WAGP', 'OIP', 
                                       'PAP',  'RETP', 'INTP', 
                                       'SEMP', 'SSIP', 'SSP', 
                                       'WAGP', 'WKHP'))

4.1.1 Consistency within housing unit income

It is also necessary to verify that the household income information is consistent with the income of each of its members. The fist check should be to verify that all members of the same housing unit report the same household income.

# set key for faster group statistics
data.table::setkey(dat, SERIALNO)

# query all housing units with more than 1 HINCP reported
dat[, .(hu_incomes_reported = uniqueN(HINCP)), by = .(SERIALNO)][hu_incomes_reported > 1]
## Empty data.table (0 rows and 2 cols): SERIALNO,hu_incomes_reported

This returns an empty query, which means that there is only one single value reported for each housing unit.

Next, we should verify that the individual income makes sense in the context of their housing unit. One might be inclined to verify that every individual’s income is smaller than income of the the housing unit it resides, however as we saw in the exploration of our target variable, individual income can be negative, hence one or more individual’s income can be larger than their household’s income to account for other individual’s losses.

The correct way to review this consistency then, is to verify that the sum of the individual incomes is equal to the housing income, including negative incomes

income_check <- dat[, .(sum_individual = sum(PINCP), HINCP = unique(HINCP)), 
                    keyby = SERIALNO
                  ][, consistent := sum_individual == HINCP]

income_check[, .N, consistent][, perc := N/sum(N)][] %>% knitr::kable()
consistent N perc
TRUE 107632 0.9453179
FALSE 6226 0.0546821

We see that roughly 95% of the households have perfectly consistent income reports. For those that do not, we can review how large is the difference. But a quick glance at the summary lets us see that all the inconsistent data is due to imputed missing values, hence the only income value in all inconsistent households is 0.

head(income_check[consistent == FALSE]) %>% knitr::kable()
SERIALNO sum_individual HINCP consistent
2019GQ0000043 7200 0 FALSE
2019GQ0000129 13200 0 FALSE
2019GQ0000141 9200 0 FALSE
2019GQ0000171 16400 0 FALSE
2019GQ0000183 26500 0 FALSE
2019GQ0000184 8600 0 FALSE
income_check[consistent == FALSE]$HINCP %>% unique()
## [1] 0

We can then use the computed summary to correct these missing values with their true values

dat[income_check, on = 'SERIALNO == SERIALNO', HINCP := i.sum_individual]

And repeating the same query, we see now that 100% of the housing unit income observations are consistent

dat[, .(sum_individual = sum(PINCP), HINCP = unique(HINCP)), 
                    keyby = SERIALNO
  ][, consistent := sum_individual == HINCP
  ][, .N, consistent][, perc := N/sum(N)][] %>%  knitr::kable()
consistent N perc
TRUE 113858 1

4.2 Numerical variables

We can now move our attention into the numerical variables generally. The skim() function from the {skimr} package lets us quickly skim through all numerical variables with the skim() function. Categorical variables are also supported by the function, but are omitted to avoid a very large table output.

# create a copy of dat with a snippet of the variable description pasted into each column name
dat_with_descriptions <- data.table::copy(dat)
desc_index <- match(colnames(dat), dictionaries$names$column_name)
desc_colnames <- paste0(colnames(dat), ': ', str_sub(dictionaries$names$description[desc_index],1,30))
colnames(dat_with_descriptions) <- desc_colnames

skimr::skim(keep(dat_with_descriptions, is.numeric))
Table 4.1: Data summary
Name keep(dat_with_description…
Number of rows 222893
Number of columns 34
_______________________
Column type frequency:
numeric 34
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
SPORDER: Person number 0 1 1.76 1.05 1 1 1 2 20 ▇▁▁▁▁
AGEP: Age 0 1 47.51 19.47 15 31 47 63 92 ▇▇▇▆▂
CITWP: Year of naturalization write-i 0 1 154.55 534.17 0 0 0 0 2019 ▇▁▁▁▁
JWMNP: Travel time to work 0 1 14.39 22.03 0 0 5 20 160 ▇▁▁▁▁
JWRIP: Vehicle occupancy 0 1 0.59 0.72 0 0 1 1 10 ▇▁▁▁▁
MARHYP: Year last married 0 1 1430.08 897.96 0 0 1985 2004 2019 ▃▁▁▁▇
WKWN: Weeks worked during past 12 mo 0 1 28.93 24.68 0 0 48 52 52 ▆▁▁▁▇
YOEP: Year of entry 0 1 362.88 769.58 0 0 0 0 2019 ▇▁▁▁▂
PINCP: Total person’s income (signed, 0 1 44006.64 67379.04 -8400 7500 25000 55000 1166600 ▇▁▁▁▁
POVPIP: Income-to-poverty ratio recode 0 1 318.50 172.43 0 171 342 501 501 ▃▃▃▂▇
NP: Number of persons in this hous 0 1 2.98 1.65 1 2 3 4 20 ▇▂▁▁▁
BDSP: Number of bedrooms 0 1 2.87 1.17 0 2 3 4 5 ▂▃▇▅▁
CONP: Condo fee (monthly amount, use 0 1 3.89 49.34 0 0 0 0 1400 ▇▁▁▁▁
ELEP: Electricity cost (monthly cost 0 1 171.63 112.67 0 100 150 230 620 ▇▇▃▁▁
FULP: Fuel cost (yearly cost for fue 0 1 12.37 124.73 0 0 0 0 3000 ▇▁▁▁▁
GASP: Gas cost (monthly cost, use AD 0 1 31.33 51.92 0 0 0 50 390 ▇▁▁▁▁
INSP: Fire/hazard/flood insurance (y 0 1 1010.01 1274.23 0 0 690 1600 7600 ▇▂▁▁▁
MHP: Mobile home costs (yearly amou 0 1 51.71 506.71 0 0 0 0 10800 ▇▁▁▁▁
MRGP: First mortgage payment (monthl 0 1 532.88 882.58 0 0 0 940 5200 ▇▂▁▁▁
RMSP: Number of rooms 0 1 5.96 2.67 0 5 6 7 16 ▂▇▅▁▁
RNTP: Monthly rent (use ADJHSG to ad 0 1 245.55 507.22 0 0 0 0 3100 ▇▁▁▁▁
SMP: Total payment on all second an 0 1 12.03 111.95 0 0 0 0 2600 ▇▁▁▁▁
VALP: Property value 0 1 185236.26 288232.18 0 0 120000 250000 2525000 ▇▁▁▁▁
WATP: Water cost (yearly cost, use A 0 1 571.53 645.74 0 50 370 960 3600 ▇▃▁▁▁
FINCP: Family income (past 12 months, 0 1 87983.56 107745.66 -12600 13900 63000 120000 1524000 ▇▁▁▁▁
GRNTP: Gross rent (monthly amount, us 0 1 290.29 583.55 0 0 0 0 3870 ▇▂▁▁▁
GRPIP: Gross rent as a percentage of 0 1 8.01 18.69 0 0 0 0 101 ▇▁▁▁▁
HINCP: Household income (past 12 mont 0 1 100617.75 105451.17 -12600 37600 74200 127100 1524000 ▇▁▁▁▁
NOC: Number of own children in hous 0 1 0.58 1.04 0 0 0 1 11 ▇▁▁▁▁
NPF: Number of persons in family (u 0 1 2.61 1.91 0 2 2 4 20 ▇▂▁▁▁
NRC: Number of related children in 0 1 0.69 1.12 0 0 0 1 13 ▇▁▁▁▁
OCPIP: Selected monthly owner costs a 0 1 13.35 17.28 0 0 9 19 101 ▇▂▁▁▁
SMOCP: Selected monthly owner costs ( 0 1 947.28 1125.83 0 0 604 1460 11148 ▇▁▁▁▁
TAXAMT: Property taxes (yearly real es 0 1 2929.23 4013.37 0 0 1450 4250 22500 ▇▂▁▁▁

From this, we can see that most numerical variables are right skewed, which is consistent with the imputation process assigning them 0 on all their missing data fields. This, however, is also consistent with the interpretation of the variables, so it is not considered data distortion as much as it is data completion.

We also want to see the correlation between all variables. Due to the large amount of variables still present, we opt for an interactive heatmap to glance through the data.

cor_dat <- cor(keep(dat, is.numeric)) %>% 
  round(2) %>% 
  as.data.table(keep.rownames = TRUE) %>% 
  data.table::melt.data.table(id.vars = 'rn', measure.vars = colnames(.)[-1]) %>% 
  setnames(old = c('rn', 'variable'), new = c('x', 'y'))

cor_dat[, x := factor(x, levels = sort(unique(x)))
      ][, y := factor(y, levels = sort(unique(x), decreasing = TRUE))]


(ggplot(data = cor_dat, 
                 mapping = aes(x = x, y = y, fill = value)) +
    geom_tile() +
    theme_minimal() +
    theme(panel.background = element_rect(fill = "transparent", colour = NA),
                   plot.background =  element_rect(fill = "transparent", colour = NA), 
                   axis.title = element_blank(),
                   axis.text.x = element_blank()) +
    viridis::scale_fill_viridis(option = 'magma')) %>% 
  plotly::ggplotly()

4.3 Categorical Variables

For categoricals, a good approach would be visualizing the densities of the target variable for each class, but since we have around hundreds of categorical variables, it is not feasible to do it on every single one in the given time frame. We manually select a few variables that seem sensible to single out. The selected variables are:

Variable Description
ESR Employment status recode
FS Yearly food stamp/Supplemental Nutrition Assistance Program (SNAP) recipiency
HINS4 Medicaid, Medical Assistance, or any kind of government-assistance plan for those with low incomes or a disability
HISPEED Broadband (high speed) Internet service such as cable, fiber optic, or DSL service
KIT Complete kitchen facilities
MAR Marital status
SCIENGP Field of Degree Science and Engineering Flag - NSF Definition
SCIENGRLP Field of Degree Science and Engineering Related Flag - NSF Definition
SMX Second or junior mortgage or home equity loan status
TEN Tenure
WKL When last worked

We then proceed to make box plots of all the selected variables. Because PINCP has around 10,000 records (5% of all observations) above 150,000, we will truncate the charts at that value to still be able to easily grasp differences in the medians of the categorical groups, without the plot being overtaken by the higher and unusual values.

categorical_plots <- categoricals_to_explore %>% 
  purrr::map2(.x = categoricals_to_explore, 
              .y = describe_column(categoricals_to_explore), 
              .f = ~chronicle::make_boxplot(dt = dat[PINCP < 150000],
                                            value = 'PINCP', 
                                            groups = .x,
                                            jitter = FALSE, 
                                            x_axis_label = .y, 
                                            static = FALSE)) %>%
  set_names(categoricals_to_explore)

4.3.1 Employment status recode

4.3.2 Yearly food stamp/Supplemental Nutrition Assistance Program (SNAP) recipiency

4.3.3 Medicaid, Medical Assistance, or any kind of government-assistance plan for those with low incomes or a disability

4.3.4 Broadband (high speed) Internet service such as cable, fiber optic, or DSL service

4.3.5 Complete kitchen facilities

4.3.6 Marital status

4.3.7 Field of Degree Science and Engineering Flag - NSF Definition

4.3.9 Second or junior mortgage or home equity loan status

4.3.10 Tenure

4.3.11 When last worked

As expected, most of these variables give a clear quantile/median distinction in regards to the individual income. These variables were chosen because they are proxies for poverty/quality of life, and have few distinct values to be comfortably visualized.

5 Feature Selection

A good feature selection process should include both subject-matter expertise and sensible statistical behaviors. It also should take into account the use of the model, and whether or not it will help perpetuate discrimination over protected groups like ethnicity, gender, nationality, migratory status, religious belief, disability, and so on. Since the purpose of the model was not disclosed as part of the briefing, this type of variables will not be taken into account on the study, and will be explicitly removed from the modeling dataset:

protected_categories description
CIT Citizenship status
CITWP Year of naturalization write-in
DREM Cognitive difficulty
HINS7 Indian Health Service
SEX Sex
DIS Disability recode
HISP Recoded detailed Hispanic origin
NATIVITY Nativity
ANC1P Recoded Detailed Ancestry - first entry
ANC2P Recoded Detailed Ancestry - second entry
RAC1P Recoded detailed race code
RAC2P Recoded detailed race code
RAC3P Recoded detailed race code
RACAIAN American Indian and Alaska Native recode (American Indian and Alaska Native alone or in combination with one or more other races)
RACASN Asian recode (Asian alone or in combination with one or more other races)
RACBLK Black or African American recode (Black alone or in combination with one or more other races)
RACNH Native Hawaiian recode (Native Hawaiian alone or in combination with one or more other races)
RACNUM Number of major race groups represented
RACPI Other Pacific Islander recode (Other Pacific Islander alone or in combination with one or more other races)
RACSOR Some other race recode (Some other race alone or in combination with one or more other races)
RACWHT White recode (White alone or in combination with one or more other races)
CPLT Couple Type
FES Family type and employment status
dat <- discard_cols(dat, protected_categories)

We can also build hand-picked variables, which are self-explanatory. We add a count on the NFP to count the respondents themselves along their family members.

dat[, hu_per_capita_income := HINCP/NP
  ][, fam_per_capita_income := FINCP/(NPF+1)
  ][, fam_children_percentage := fifelse(test = AGEP < 18, 
                                     yes = (NRC+1)/(NPF+1),
                                     no = (NRC)/(NPF+1))
  ][, rooms_per_person := RMSP/NP 
  ][, bedrooms_per_person := BDSP/NP 
  ][, commute_as_perc_worked := JWMNP/WKWN]

Since we are confident about our data cleansing process, and good data practices from the Census Bureau, we can rely on H2O’s randomForest heuristics to drop ‘bad columns’. The main reason why a variable might be dropped is it being a categorical variable with a large quantity of distinct values, since ensemble tree models tend to memorize the categories and end up overfitting the data even in a small number of epochs. We will not make outlier treatments since the Bureau already censors extreme values to keep the privacy of the respondents, and since we are going to use a tree based model, feature scaling is not needed.

6 Modelling

6.1 H2O local cluster setup

We first setup a local H2O cluster, allocate it 16GB of RAM and split the data into a train and testing set. The training will be done with cross validation, but we stil want to test its performance on unseen data.

library(h2o)

#set up the H2O local cluster
h2o.init(max_mem_size = '16G') # allocating 16GB of RAM
##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         5 hours 6 minutes 
##     H2O cluster timezone:       America/Regina 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.32.0.1 
##     H2O cluster version age:    4 months and 8 days !!! 
##     H2O cluster name:           H2O_started_from_R_pheym_ydg016 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   12.18 GB 
##     H2O cluster total cores:    8 
##     H2O cluster allowed cores:  8 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         Amazon S3, Algos, AutoML, Core V3, TargetEncoder, Core V4 
##     R Version:                  R version 4.0.3 (2020-10-10)
h2o.no_progress() # disable progress bars

# feed all variables to the model (except indexes and the target variable)
x <- setdiff(colnames(dat), c('PINCP', 'SERIALNO', 'SPORDER'))
y <- 'PINCP'

# import dataset into h2o cluster
model_dat <- as.h2o(dat)

dat_split <- h2o.splitFrame(data = model_dat, ratios = 0.8, seed = 1)
train <- dat_split[[1]]
test <- dat_split[[2]]

Then we start off with a random forest model with H2O’s very reasonable preset values (ntrees = 50, max_depth = 20, nbins = 20, no early stopping)

6.2 Initial model

# build a cross validated model
rf <- h2o.randomForest(x = x,
                       y = y,
                       training_frame = train,
                       nfolds = 10, 
                       max_runtime_secs = 100, 
                       seed = 1)
rf@model$model_summary %>% knitr::kable()
number_of_trees number_of_internal_trees model_size_in_bytes min_depth max_depth mean_depth min_leaves max_leaves mean_leaves
4 4 1594272 20 20 20 30017 33530 31702.75

And we can see there is stability in the cross validation metrics

rf@model$cross_validation_metrics_summary %>% knitr::kable()
mean sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid cv_5_valid cv_6_valid cv_7_valid cv_8_valid cv_9_valid cv_10_valid
mae 17412.41 109.89248 17429.164 17466.066 17406.348 17540.42 17543.006 17301.305 17516.516 17249.408 17263.135 17408.734
mean_residual_deviance 1.66010061E9 4.2154416E7 1.65868275E9 1.6924E9 1.6783031E9 1.71451878E9 1.69212698E9 1.61251763E9 1.67506803E9 1.58892877E9 1.60836902E9 1.6800905E9
mse 1.66010061E9 4.2154416E7 1.65868275E9 1.6924E9 1.6783031E9 1.71451878E9 1.69212698E9 1.61251763E9 1.67506803E9 1.58892877E9 1.60836902E9 1.6800905E9
r2 0.63515854 0.012307591 0.6388053 0.64747477 0.610129 0.61927164 0.63934875 0.64385724 0.6270948 0.64332086 0.64453024 0.6377529
residual_deviance 1.66010061E9 4.2154416E7 1.65868275E9 1.6924E9 1.6783031E9 1.71451878E9 1.69212698E9 1.61251763E9 1.67506803E9 1.58892877E9 1.60836902E9 1.6800905E9
rmse 40741.355 519.1445 40726.93 41138.79 40967.098 41406.746 41135.47 40156.164 40927.598 39861.37 40104.477 40988.906
rmsle NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

And we can see that the performance on the unseen data is virtually identical to the cross validation metrics. The error metrics seem to be high, however, especially MAE and RSME in context of the income bands we saw through our variable exploration.

# check performance over test set
rf@model$cross_validation_metrics
## H2ORegressionMetrics: drf
## ** Reported on cross-validation data. **
## ** 10-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
## 
## MSE:  1659915543
## RMSE:  40742.06
## MAE:  17412.05
## RMSLE:  NaN
## Mean Residual Deviance :  1659915543
h2o.performance(model = rf, newdata = test)
## H2ORegressionMetrics: drf
## 
## MSE:  1656260021
## RMSE:  40697.17
## MAE:  17363.03
## RMSLE:  NaN
## Mean Residual Deviance :  1656260021

6.3 Second model

We will try to improve the model by letting it fit the data with deeper trees and more run time.

rf <- h2o.randomForest(x = x,
                       y = y,
                       training_frame = train,
                       nfolds = 10, 
                       max_runtime_secs = 600, 
                       seed = 1, 
                       max_depth = 30)

rf@model$model_summary %>% knitr::kable()
number_of_trees number_of_internal_trees model_size_in_bytes min_depth max_depth mean_depth min_leaves max_leaves mean_leaves
11 11 9811838 30 30 30 65796 73579 70857.09
# check performance over test set
rf@model$cross_validation_metrics
## H2ORegressionMetrics: drf
## ** Reported on cross-validation data. **
## ** 10-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
## 
## MSE:  1598117362
## RMSE:  39976.46
## MAE:  17511.41
## RMSLE:  NaN
## Mean Residual Deviance :  1598117362
h2o.performance(model = rf, newdata = test)
## H2ORegressionMetrics: drf
## 
## MSE:  1571089302
## RMSE:  39636.97
## MAE:  17468.21
## RMSLE:  NaN
## Mean Residual Deviance :  1571089302

This does not yield significant improvements, and the scoring history suggests that the performance may not improve without extensive training time, time that is not available for this delivery.

rf@model$scoring_history %>% knitr::kable()
timestamp duration number_of_trees training_rmse training_mae training_deviance
2021-02-17 08:18:34 9 min 29.722 sec 0 NaN NaN NaN
2021-02-17 08:18:40 9 min 35.406 sec 1 51606.63 22001.29 2663244575
2021-02-17 08:18:45 9 min 40.862 sec 2 50391.92 21279.85 2539345351
2021-02-17 08:18:51 9 min 46.431 sec 3 49193.32 20787.89 2419982874
2021-02-17 08:18:56 9 min 51.837 sec 4 48111.16 20386.16 2314683839
2021-02-17 08:19:02 9 min 57.417 sec 5 47113.95 19990.21 2219723952
2021-02-17 08:19:07 10 min 2.676 sec 6 46256.33 19669.67 2139648160
2021-02-17 08:19:12 10 min 8.220 sec 7 45376.69 19344.02 2059043642
2021-02-17 08:19:17 10 min 13.275 sec 8 44642.62 19120.42 1992963291
2021-02-17 08:19:23 10 min 18.835 sec 9 44234.08 18958.57 1956653576
2021-02-17 08:19:28 10 min 24.261 sec 10 43792.05 18780.36 1917743854
2021-02-17 08:19:34 10 min 30.072 sec 11 43372.88 18647.31 1881206870

6.4 Brute forcing throur H2O’s AutoML

We can then review if the limiting factor is the models/training time available by letting H2O’s autoML function seek for an optimal model

aml <- h2o.automl(x = x,
                  y = y,
                  training_frame = train,
                  nfolds = 10, 
                  max_runtime_secs = 600, 
                  seed = 1)
## 
## 08:19:36.150: AutoML: XGBoost is not available; skipping it.
aml@leaderboard %>% knitr::kable()
model_id mean_residual_deviance rmse mse mae rmsle
StackedEnsemble_AllModels_AutoML_20210217_081936 1243525587 35263.66 1243525587 16669.43 NA
StackedEnsemble_BestOfFamily_AutoML_20210217_081936 1272165073 35667.42 1272165073 16712.49 NA
GBM_grid__1_AutoML_20210217_081936_model_2 1295137783 35988.02 1295137783 16570.12 NA
GBM_grid__1_AutoML_20210217_081936_model_1 1322990856 36372.94 1322990856 18311.98 NA
GBM_1_AutoML_20210217_081936 1352088446 36770.76 1352088446 18732.81 NA
DeepLearning_1_AutoML_20210217_081936 1381936243 37174.40 1381936243 18363.11 NA
GBM_2_AutoML_20210217_081936 1389666399 37278.23 1389666399 19184.51 NA
GBM_3_AutoML_20210217_081936 1436199465 37897.22 1436199465 19808.92 NA
DeepLearning_grid__1_AutoML_20210217_081936_model_1 1456933035 38169.79 1456933035 19275.38 NA
DeepLearning_grid__2_AutoML_20210217_081936_model_1 1673543654 40908.97 1673543654 26837.61 NA
GBM_4_AutoML_20210217_081936 1698063678 41207.57 1698063678 22330.91 NA
DeepLearning_grid__1_AutoML_20210217_081936_model_2 1773730116 42115.68 1773730116 22247.37 NA
DeepLearning_grid__3_AutoML_20210217_081936_model_1 1895573929 43538.19 1895573929 31077.77 NA
GBM_5_AutoML_20210217_081936 2224216242 47161.60 2224216242 26390.10 NA
XRT_1_AutoML_20210217_081936 2328528204 48254.83 2328528204 19664.68 NA
DRF_1_AutoML_20210217_081936 2397969527 48969.07 2397969527 19656.57 NA
GBM_grid__1_AutoML_20210217_081936_model_3 3479278863 58985.41 3479278863 33755.86 NA
GLM_1_AutoML_20210217_081936 4552838436 67474.72 4552838436 38936.09 NA
aml@leader@model$model_summary
## Number of Base Models: 16
## 
## Base Models (count by algorithm type):
## 
## deeplearning          drf          gbm          glm 
##            5            2            8            1 
## 
## Metalearner:
## 
## Metalearner algorithm: glm
## Metalearner cross-validation fold assignment:
##   Fold assignment scheme: AUTO
##   Number of folds: 10
##   Fold column: NULL
## Metalearner hyperparameters:

While the model complexity skyrockets, the performance metrics do not show an improvement meaningful enough to justify its additional complexity. We can then conclude that to improve the model, we would need to improve the feature engineering process, or reincorporate some income variables to use as proxies for our final goal.

We then proceed finishing the model performance analysis with our random forest.

6.5 Contextualizing the error metrics

While MAE and RMSE give an idea of the range of values that might be predicted, they are not ideal to understand the fit for different income bands, since being 17,000 USD off means quite different things for people in the first vs last quintiles. Hence we propose using the median relative error to see the overall performance across different income bands.

# predict over the test set
pred <- h2o.predict(object = rf, 
                    newdata = test)

# compute relative error for each observation
pred_frame <- cbind(as.data.table(test)[, .(PINCP)],
                    as.data.table(pred)
                    )[, absolute_error := abs(predict-PINCP)
                    ][, relative_error := absolute_error/PINCP]

# create a grouping column for income quintiles
discretize_column(dt = pred_frame,
                  amount_column = 'PINCP',
                  band_cuts = quantile(pred_frame$PINCP, probs = seq(0, 1, .2)), 
                  bands_column_name = 'income_band')

# check the performance in each quintile
pred_frame[, .(population_size = .N,
               median_absolute_error = median(absolute_error),
               median_relative_error = median(relative_error)),
           by = income_band][order(income_band)] %>% knitr::kable()
income_band population_size median_absolute_error median_relative_error
-6300-3500 8899 1299.585 NA
3500-17400 8893 4269.091 0.4482762
17400-35000 8658 5387.825 0.2148308
35000-65000 9088 8769.500 0.1886000
65000-1142000 8966 26504.545 0.2600342

The model is seeing modestly good results in the upper income bands, but does quite poorly in the lower ones. It is visible that the model had smaller absolute errors on the lower bands, but not enough to overcome the higher penalty that the smaller bands entail in regards to relative error.

And to see the performance without evening out good and bad results, we can draw a line in the sand by defining a ‘good’ result as a predicted value with a relative error below .15, and see the percentage of good results in each income band.

# flag good predictions
pred_frame[, good_prediction := relative_error < .15]

pred_frame[, .(perc_good_pred = mean(good_prediction)), by = income_band][order(income_band)] %>% knitr::kable()
income_band perc_good_pred
-6300-3500 NA
3500-17400 0.3241876
17400-35000 0.4243474
35000-65000 0.4477333
65000-1142000 0.3456391

In conclusion, our model is modestly competent and stable across the higher positive income bands, while taking a significant hit on the lower income bands.

To finalize, we proceed to revise which variables the model took into account. We do this by reviewing the model’s variable importance ranking.

var_importance <- as.data.table(h2o.varimp(rf))[, cumulative_perc := cumsum(percentage)
                                              ][, relative_importance := NULL
                                              ][, description := describe_column(variable)]

head(var_importance, 10) %>% knitr::kable()
variable scaled_importance percentage cumulative_perc description
HINCP 1.0000000 0.1579654 0.1579654 Household income (past 12 months, use ADJINC to adjust HINCP to constant dollars)
hu_per_capita_income 0.8931361 0.1410846 0.2990500 Not on ACS dictionary
fam_per_capita_income 0.7916091 0.1250468 0.4240968 Not on ACS dictionary
POVPIP 0.6112539 0.0965570 0.5206537 Income-to-poverty ratio recode
WKWN 0.6030458 0.0952604 0.6159141 Weeks worked during past 12 months
AGEP 0.4140549 0.0654063 0.6813204 Age
FINCP 0.3997472 0.0631462 0.7444667 Family income (past 12 months, use ADJINC to adjust FINCP to constant dollars)
JWMNP 0.1845843 0.0291579 0.7736246 Travel time to work
commute_as_perc_worked 0.1721979 0.0272013 0.8008259 Not on ACS dictionary
VALP 0.1536660 0.0242739 0.8250998 Property value

The model relied heavily in the household and family income, and their respective per-capita approximations. These were expected to be our most robust proxies for the personal income. Then hours worked, age and property value, which stands to reason that they all will positively correlate with individual income.

7 Closing Thoughts

Overall, we have obtained an interesting baseline for individual income prediction, and with the ACS richness in data, there is a lot of potential to improve upon these current findings. Some suggestions:

  • Cross worked hours with average salaries by occupation to distinguish the value of those hours.
  • Perform more complex household metrics like the count of different profiles by age, occupation, education level, and so on. For example, the count and percentage of college educated adults by household.
  • Use time of arrival and departure to distinguish night shifts, part-time and freelance jobs.
  • Extend training time from minutes to hours.
  • Use grid search methods for finding optimal forest parameters.
  • Train several models to predict each income separately, and then add their predictions.

8 Appendix

8.1 The {chronicle} package

This package was developed by Philippe Heymans, and aims to provide a seamless experience while building sophisticated and interactive visualizations in R, handling most of their challenges though its collection of helper functions. For further detail, please visit its github page.

8.2 Helper functions

This is a small collection of facilitator functions meant to make the output more readable, and avoid distractions from formatting or complex code. With very little effort, these could be built into an R package and have one-liner custom solutions for PUMS data processing.

# fetch the column description from the NAME entries of the data dictionary
describe_column <- function(col = NULL){
  descrip_index <- match(col, dictionaries$names$column_name)
  descr <- dictionaries$names$description[descrip_index]
  
  if(any(is.na(descr))){
    descr[is.na(descr)] <- 'Not on ACS dictionary'
  }
  return(descr)
}

# run summary statistics over a numerical column of a data.table (optionally grouped)
num_summary <- function(dt, col, by = NULL){
  # rename the column to a generic variable (to avoid eval() or get() calls)
  data.table::setnames(x = dt, old = col, new = 'value')
  
  # perform the computations
  summ <- dt[, .(count = sum(!is.na(value)),
               distinct_values = uniqueN(value),
               na_count = sum(is.na(value)),
               na_perc = sum(is.na(value))/length(value),
               sum = sum(value, na.rm = TRUE),
               mean = mean(value, na.rm = TRUE),
               sd = sd(value, na.rm = TRUE),
               min = min(value, na.rm = TRUE),
               percentile.25 = quantile(value, 0.25, na.rm = TRUE),
               percentile.50 = quantile(value, 0.5, na.rm = TRUE),
               percentile.75 = quantile(value, 0.75, na.rm = TRUE),
               percentile.975 = quantile(value, 0.975, na.rm = TRUE),
               percentile.99 = quantile(value, 0.99, na.rm = TRUE),
               max = max(value, na.rm = TRUE)),
             by = by]
  
  # rename the column back to its original name
  data.table::setnames(x = dt, old = 'value', new = col)
  
  return(summ)
}

# compute the count and percentage of missing entries for each column of a data.frame 
na_percentage <- function(dt){
  data.table(column = colnames(dt),
             na_count = purrr::map_int(.x = dt, 
                                       .f = ~sum(is.na(.x))
                                       ))[, na_perc := na_count/nrow(dt)][]
}

# removes all columns matching a character vector (optionally regular expressions)
discard_cols <- function(dt, cols, regex = FALSE){
  if(!regex){
    dt %>% purrr::discard(colnames(.) %in% cols)
  }else{
    dt %>% purrr::discard(str_detect(colnames(.), paste(cols, collapse = '|')))
  }
}


na_fill = function(dt, cols, replacement = 0) {
  # match the indexes of the columns specified by the user
  index <- match(cols, colnames(dt))
  purrr::walk(index, ~data.table::set(dt, which(is.na(dt[[.x]])), .x, replacement))
}


# parse the two separate specifications entangled in the ACS dictionary file
parse_dictionaries <- function(dict_filename){
  dictionaries <- readr::read_lines(dict_filename) %>%
    unique() %>%
    split(stringr::str_detect(., '^NAME')) %>% # identify each dataset
    purrr::set_names(c('values', 'names')) %>%
    purrr::map2(.y = list(val_columns = c('field_type',
                                          'column_name',
                                          'column_type',
                                          'field_length',
                                          'starting_value',
                                          'ending_value',
                                          'description'),
                          name_columns = c('field_type',
                                           'column_name',
                                           'column_type',
                                           'field_length',
                                           'description')),
                .f = ~data.table::fread(text = c(paste(.y, collapse = ','),
                                                 .x)))
  
  # find the index of SPORDER in both the $names and $values tables
  sporder_index <- dictionaries %>%
    purrr::map(~which(.x$column_name %chin% 'SPORDER'))
  
  # map the corresponding dataset flag to each row
  walk2(.x = dictionaries,
        .y = sporder_index,
        .f = ~.x[, record_number := 1:.N
        ][, dataset := data.table::fifelse(test = record_number < .y,
                                           yes = 'H',
                                           no = 'P')])
  return(dictionaries)
}

# decode categorical columns with the values provided in the VAL dictionary.
decode_categoricals <- function(col, dat){
  # select the dictionary entries speciying column values
  column_mapping <- dictionaries$values[column_name %chin% col & starting_value == ending_value,
                                        .(starting_value, description)] %>% 
    data.table::setkey(starting_value)
  
  # data.table 'left join' by reference
  # technically this is a mutate by reference, not a join
  matching_statement <- paste0(col, '==', 'starting_value')
  setkeyv(dat, col)
  dat[column_mapping, on = matching_statement, (col) := i.description]
}

pums_for_individual_modelling <- function(person_filename,
                                          housing_unit_filename,
                                          dictionary_filename){

  tictoc::tic('PUMS files successfully parsed')
  # 0. load and join data ---
    
  # load person data
  per <- data.table::fread('Technical Test Data and Information/psam_p48.csv') %>% 
    discard_cols(cols = 'RT')

  # load housing unit data
  hu <- data.table::fread('Technical Test Data and Information/psam_h48.csv') %>% 
    discard_cols(cols = 'RT')

  dictionaries <- parse_dictionaries(dict_filename)
  
  join_cols <- intersect(colnames(per), colnames(hu))
  # this is c("SERIALNO", "DIVISION", "PUMA", "REGION", "ST", "ADJINC")
  
  # set data.table keys for fast joining
  data.table::setkeyv(per, join_cols)
  data.table::setkeyv(hu,  join_cols)
  
  # data.table left join
  pums <- data.table::merge.data.table(x = per, hu, all.x = TRUE)
  
  character_cols <- dictionaries$names[column_type == 'C']$column_name %>% intersect(colnames(dat))
  pums[, (character_cols) := purrr::map(.SD, as.character), .SDcols = character_cols]
  
  # remove the separate tables from memory
  rm(per)
  rm(hu)
  
  # 1. discard columns ----
  
  # weight columns
  pums <- discard_cols(dt = pums, cols = c('^WGTP', '^PWGTP'), regex = TRUE)
  
  # flag columns
  flag_columns <- dictionaries$names[str_detect(description, 'allocation flag')]
  pums <- discard_cols(dt = pums, cols = flag_columns, regex = TRUE)
  
  # constant_columns
  unique_values <- data.table(column = colnames(pums),
                              unique_values = purrr::map_int(pums, uniqueN)
                              )[unique_values == 1]
  
  pums <- discard_cols(dt = pums, cols = unique_values$column)
  
  # 2. data imputation ----
  numeric_impute <- dictionaries$names[column_type == 'N']$column_name %>%
    intersect(colnames(pums)) %>% 
    setdiff('PINCP') # our target variable deserves special consideration 
  
  na_fill(dt = pums,  
          cols = numeric_impute,
          replacement = 0)
  
  # for categoricals, only the variables that have an explicit 'b' = 'Does not apply' field will be replaced.
  categorical_impute <- dictionaries$names[column_type == 'C']$column_name %>%
    intersect(colnames(pums)) %>% 
    intersect(dictionaries$values[column_type == 'C' & starting_value == 'b']$column_name)
  
  na_fill(dt = pums,  
          cols = categorical_impute,
          replacement = 'b')
  
  
  # 3. categorical variable decoding ----
  
  # select the categorical variables according to the dictionary
  columns_to_decode <- dictionaries$names[column_type %chin% 'C']$column_name 
  
  # do not decode range categorical variables (ie where starting_value != ending_value)
  # These are c("SERIALNO", "SERIALNO", "PUMA", "MIGPUMA", "POWPUMA", "RACNUM")
  exclude_from_decoding <- 
    dictionaries$values[column_type %chin% 'C' & starting_value != ending_value]$column_name
  
  columns_to_decode %<>% setdiff(exclude_from_decoding) 
  
  # and remove the columns we already removed from the dataaset
  columns_to_decode %<>% intersect(colnames(pums))
  
  # set all categorical variables as character vectors to avoid type inconsistencies 
  pums[, (columns_to_decode) := purrr::map(.SD, as.character), .SDcols = columns_to_decode]
  
  columns_to_decode %>% walk(decode_categoricals, dat = pums)
  
  tictoc::toc()
  
  return(pums[])
}

# create a new column marking the value band of each observation wrt a continuous column 
discretize_column <- function(dt,
                              amount_column,
                              band_cuts,
                              bands_column_name) {
  
  amount_column %<>% as.character()
  
  # create empty column
  data.table::setDT(dt)
  dt[, (bands_column_name) := NA_character_]
  
  # build index for the bands
  run <- seq_along(band_cuts)
  band_length <- length(band_cuts)
  
  # create names for the bands
  band_names <- paste0(band_cuts[-band_length], '-', band_cuts[-1])
  
  # assign band to each value
  purrr::walk(run, ~data.table::set(dt,
                                    which(band_cuts[.x] <= dt[[amount_column]] &
                                            dt[[amount_column]] <= band_cuts[.x + 1]),
                                    bands_column_name,
                                    band_names[.x])[])
  
  dt[, (bands_column_name) := factor(x = get(bands_column_name), levels = band_names)]
}

8.3 Session Info

sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19042)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] h2o_3.32.0.1      skimr_2.1.2       readr_1.4.0       stringr_1.4.0     purrr_0.3.4       ggplot2_3.3.3     knitr_1.31       
##  [8] chronicle_0.2.5   data.table_1.13.6 magrittr_2.0.1    rlang_0.4.10     
## 
## loaded via a namespace (and not attached):
##  [1] highr_0.8         pillar_1.4.7      compiler_4.0.3    bitops_1.0-6      rmdformats_1.0.1  base64enc_0.1-3   viridis_0.5.1    
##  [8] tools_4.0.3       bit_4.0.4         digest_0.6.27     viridisLite_0.3.0 jsonlite_1.7.2    evaluate_0.14     lifecycle_0.2.0  
## [15] tibble_3.0.4      gtable_0.3.0      pkgconfig_2.0.3   rstudioapi_0.13   crosstalk_1.1.0.1 yaml_2.2.1        xfun_0.20        
## [22] gridExtra_2.3     withr_2.3.0       httr_1.4.2        repr_1.1.3        dplyr_1.0.2       htmlwidgets_1.5.3 generics_0.1.0   
## [29] vctrs_0.3.6       tictoc_1.0        hms_0.5.3         bit64_4.0.5       grid_4.0.3        tidyselect_1.1.0  glue_1.4.2       
## [36] R6_2.5.0          plotly_4.9.3      bookdown_0.21     rmarkdown_2.6     farver_2.0.3      tidyr_1.1.2       scales_1.1.1     
## [43] ellipsis_0.3.1    htmltools_0.5.0   colorspace_2.0-0  labeling_0.4.2    stringi_1.5.3     RCurl_1.98-1.2    lazyeval_0.2.2   
## [50] munsell_0.5.0     crayon_1.3.4